Goto

Collaborating Authors

 transformer encoder


64f1f27bf1b4ec22924fd0acb550c235-Paper.pdf

Neural Information Processing Systems

The proposed MLP decoder aggregates information from different layers, andthus combining both local attention and global attention to render powerful representations.



Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models Ziyi Yin 1 Muchao Y e

Neural Information Processing Systems

Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks.





TransMatcher: DeepImageMatchingThrough TransformersforGeneralizablePerson Re-identification: Appendix

Neural Information Processing Systems

Some algorithms perform unstably across different runs, thus the average among several runsisamorestablemeasure. Using a unified measure is convenient, concise, and space-saving for ablation study and parameteranalysis. HereH = hand W = w,but to be clear,let'sdenote them differently. Then in Eq. (7), GMP is applied along the last dimension ofhw elements, resulting in a vector of sizeHW. Third, the proposed method has already considered the efficiency,with itssimplified decoder and balanced parameter selection, and thus it is the most efficient one in cross-matching Transformers as shown in Table 2 of the main paper.


Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked Auto-Encoder

Neural Information Processing Systems

Despite its practical importance across a wide range of modalities, recent advances in self-supervised learning (SSL) have been primarily focused on a few well-curated domains, e.g., vision and language, often relying on their domain-specific knowledge. For example, Masked Auto-Encoder (MAE) has become one of the popular architectures in these domains, but less has explored its potential in other modalities. In this paper, we develop MAE as a unified, modality-agnostic SSL framework. In turn, we argue meta-learning as a key to interpreting MAE as a modality-agnostic learner, and propose enhancements to MAE from the motivation to jointly improve its SSL across diverse modalities, coined MetaMAE as a result. Our key idea is to view the mask reconstruction of MAE as a meta-learning task: masked tokens are predicted by adapting the Transformer meta-learner through the amortization of unmasked tokens. Based on this novel interpretation, we propose to integrate two advanced meta-learning techniques. First, we adapt the amortized latent of the Transformer encoder using gradient-based meta-learning to enhance the reconstruction. Then, we maximize the alignment between amortized and adapted latents through task contrastive learning which guides the Transformer encoder to better encode the task-specific knowledge. Our experiment demonstrates the superiority of MetaMAE in the modality-agnostic SSL benchmark (called DABS), significantly outperforming prior baselines.


The Power of Hard Attention Transformers on Data Sequences: A formal language theoretic perspective

Neural Information Processing Systems

Formal language theory has recently been successfully employed to unravel the power of transformer encoders. This setting is primarily applicable in Natural Language Processing (NLP), as a token embedding function (where a bounded number of tokens is admitted) is first applied before feeding the input to the transformer.


Locality-Aware Generalizable Implicit Neural Representation

Neural Information Processing Systems

Generalizable implicit neural representation (INR) enables a single continuous function, i.e., a coordinate-based neural network, to represent multiple data instances by modulating its weights or intermediate features using latent codes. However, the expressive power of the state-of-the-art modulation is limited due to its inability to localize and capture fine-grained details of data entities such as specific pixels and rays. To address this issue, we propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder. The transformer encoder predicts a set of latent tokens from a data instance to encode local information into each latent token. The locality-aware INR decoder extracts a modulation vector by selectively aggregating the latent tokens via cross-attention for a coordinate input and then predicts the output by progressively decoding with coarse-to-fine modulation through multiple frequency bandwidths. The selective token aggregation and the multi-band feature modulation enable us to learn locality-aware representation in spatial and spectral aspects, respectively. Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks such as image generation.